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ABSTEACT 

The effects of incentive conditions on the results of 
a confidence test vere investigated* Two hundred thirty high school 
subjects were adBinistei:ed a very difficult confidence scored test 
under tvo conditions; 1} that the test would count heavily on their 
grades (incentive condition) and 2) that the test vas for research 
purposes a^id would not be counted (relaxed condition) « An analysis of 
the data rJisivealed: 1) Dnder incentive conditions, scores on 
confidence tests are higher, and reliability significantly low^r when 
compared to the relaxed conditions. 2) Eeoales have a greater 
tendency toward taking extreme positions than males^ especially in 
the incentive condition. 3) Subjects in the incentive group liked the 
test bett^^r, had more of a tendency to take extreme positions, and 
made more appropriate estimates of their degree cf confidence. 4) 
Middle socioeconomic subjects, compared to both upper and lower 
socioeconomic subjects, made higher scores and sore appropriate 
estimates of confidence. 5) High scoring subjects gambled more on 
difficult items under the relaxed condition, but gambled less cn 
difficult items in the incentive condition. 6) Positive attitudes 
toward the tests were directly related to degree of confidence. 
(Author/HLP) 



ERLC 



COSFIDENCE TEST SCORING AKD INCfiNTIVE CONDITIONS 



us DEPARTMENT OF HEALTH. 
EDUCATION ^WELFARE 
NATIONAL INSTITUTE OF 
EDUCATION 

THIS DOCUMENT HAS BEEN 
OUCEO EXACTLY AS RECEIVED FROM 
THE PERSON OR ORGANIZATION OR>G>N 
ATING)T POINTS OF VIEW OR OPINIONS 
STATED 00 NOT NECESSARILY REPRE 
SENT OFFICIAL NATIONAL INSTITUTE OF 
EDUCATION POSITION OR POLICY 



A Paper Presented to the NCbE 
April 16, 1974 
Chicago, Illinois 



Robert M* Rippey, Professor 
University of Connecticut 
Schools of Medicine & Dental Medicine 
Farmington, Connecticut 06032 



Confidence testing asks subjects to assign probabilities of confidence to 
the options of multiple choice items. Considerable disputation has arisen over 
the importance and efficacy of these procedures (Hambelton, Roberts, and Traub, 
1970) and Rippey (1970), and a summary of some of the arguments is contained in 
Wang and Stanley (1970), My o\m continued interest in confidence testing lies not 
in the area of the alleged improved psychometric properties of confidence tests, 
but in the area outlined rather early by DeFinetti. How do we get persons to 
become better assessors of their ovm confidence? 

Accurate assessments of confidence are especially important In areas 
involving incomplete knowledge of data, and in areas where important decisions 
mush be based on an inadequate body of theory. Some of the early work in con- 
fidence testing was based on utility theory. Scoring fpinctions were developed 
which produced maximum scores in the long run if and only if the subject maximized 
his expected utility, given a knowledge of the payoff of his choices (Shuford, 
Albert, and Massengill, 1966). Unfortunately, one man*s utility is sometimes 
another man' s' poison. There are differences in sex, social class, and condition 
of administration which interact with item difficulty and contribute to error 
variance in the confidence testing situation. 
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Two hundred sixty-three sophomore and junior students from a high school 
in a s^uburb of Chicago were randomly assigned to t\70 groups and administered 
fifteen very difficult items from the STEP Writing Test, Level 1. Ss were 
told that the items might or might not have unique correct r^isponses.l' One 
group was told that the test they were taking V70uld count toward their grades 
in English. The other group was told that the test was being administered for 
research purposes and would not be counted on their grades. The tcachei^s were 
given the grades of the subjects in the incentive group, and they had azxccd to 
utili;:e them in grading, although the amount of \^eight to be given to the results 
was not specified. Ss were instructed in the system of scoring to be used as 
follows: 



Permission to use this test was granted by the Educational Testing 

Service. 



Each of the questions in this test is followed by suggested ansvrers. 
Assign a number from 0 to 9 to each suggested answer, depending on 
how strongly you feel that the answer is correct. If you believe that 
only one suggested answer is correct, mark that answer with a 9 and 
mark the otherCs) with zeros. If you like the suggested answers 
equally, assign the same number to each. The sum of the three res- 
ponses should add up to 9 . . . 

If your answer is closer to the right ansx^er, you will get a positive 
score. If it is closer to the wrong answer you will get a negative 
score. The scores vary from -1 to +1. They are multiplied by your 
certainty, (€)• 



The test itself was preceeded by a six-item practice test at the end of 
which subjects were given the right answer for each question and could ask any 
question about the instructions. They were told that for the practice test 
there was one single right answer, but for the test itself, there might or 
might not be morG than one single right answer to each item. The items were 
scored using the Weighted Euclidean function S C(l - 2D / Djnax) vhere: 

C « Confidence (0^C^9) 

D Distance from S^'s res^ionse to the criterion group response, 

^max ^ Maximum distance attainable from the criterion group response. 



Ss were asked to fill out a personal data sheet, and were given a test of 5 
personality variables.-?.' From these instruments the following variables were 
measured: 

1, Sex: Male = 1, Female = 2 
• 2. Year in School: 1 « Sophomore, 2 " Junior 

3. Score: Hean weighted Euclidean score on the 15 item writing test 

4. Attitude: 0 = maximum dislike for test, 9 = maximum liking 

5. ^Confidence: 0 « minimum confidence in responses, 9 « maximum 

6. Autonomy: Scale score from Personality Research Inventory 
7f Harm Avoidance: Personality Research Inventory 

8. Impulsivity: Scale from Personality Research Inveilitory 
9« Order: Scale from Personality Research liiventory 

10. Succorance: Scale from Personality Inventory 

11. Social Class: (on a tliree-point scale) Low « 1, Middle « 2, Upper « 

12. Appropriateness of Confidence (OTLN) 

13. Propensity to gamble (PLK) 

14. Appropriateness: of Confidence on an item of medium difficulty 

15. Gambling propensity on an item of medium difficulty 

16. Appropriateness of Confidence on an easy item, f?7 

17. Gambling propensity on an easy item 

18. Appropriateness on a difficult item, ^^13 

19. Gambling propensity on a difficult item 



Scales Au, Ha, In, Or, Su, from Douglas Jackson, Personality Research 
^'"na. Form AA, Research Psychologists Prcsr , Inc. 1965. 
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Some explanation is necessary on the computation of variables 12 through 

19. 

The propensity to gample, PLN, for an item was equal to the sum of the 
squares of the differences between numerical response for each of the responses 
and three, divided by six. That is; 



PLN = (ZH (r, -3) )/6 for the i item, 

where 0 ^ r^^ B 

and TI! r, =9, j = option number 
J=1 J 



Since subject responses ranged from 0 to 9 for the three options, Ss 
who had no preference for the options, and who expressed this lack of preference 
by responding (3,3,3) to the three options would receive a PLN equal to zero. On 
the other hand, S showing a complete preference for a single option (propensity 
to gamble) would receive PLN = (36 + 9 -f 9)/6 = 9. Thus PLN is an index of the 
subject's tendency to select a single option* PLN for a test would then consist 
of the average value of PLN over all the items. 

Appropriateness of confidence compares S's PLN with his expressed confidence 
in the item. For the i— item, appropriateness of confidence (WPLN) is the 
absolute value of the difference between S' s PLN for that item and his confidence 
measure, Cj^i 

WFLN^ = jpLN^ " 

Theoretically, a person with no knowledge should declare Cj[ « 0 and distribute 
his responses (3,3,3). This would niake PLN = 0 and C « 0. Thus a score of 0 
on WPLN indicates congruence between PLN and Cj^. A £ who is certain of his 
response would mark one option with a nine and the other options with zeroes. 
This wuld make PLN = 9. If he was that certain, he should also mark C = 9, 
again giving WPLN » 0. Positive values of WPLN indicate a discrepancy between 
confidence and one's behavior in distributing his response..:. 

Means and standard deviations of the 19 variables under the relaxed and 
the incentive conditions are shown in Tables 1 and 2. 

The reliability of the test under the iTBCentlve condition was 0.261. 
Under the relaxed condition it was 0.^<93. Although the kcc^ix scores were sig* 
nlficantly higher under the incentive condition, the reliability of these 
scores was consistently lower. Although these reliabilities may seem low, it 
must be remembered that the items were only 1/4 of the ite-ns from the original 
test. When corrected for l<5ngth, reliabilities are close tip the published values. 

Ss reported a slightly more favorable attitude toward the test under the 
Incentive condition. Although the average liking in both V7;>.s low there was a 
slgntficsntly greater anount of confidence than there was in the relaxed group, 
along with a significantly higher propensity to improve their score. The 



confidence expressed in the Incentive group was more congruent with their 
distribution of preference than was the confidence expressed by the relaxed 
group. Confidence was most appropriate on the easy item, and was least 
appropriate on the item of moderate difficulty. 

Using data shown in Table 3, Grozelier (1970) concluded that girls were 
slightly more sensitive to the incentive effect than were boys. With regard 
to the level of risk-taking, boys were rather conservative and girls high- 
risk oriented. This would follow from an assumption that the motive to achieve 
success wuld be stronger among boys whereas girls would rather be failure 
avoidance oriented* 
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MP AM 


TDOOO 


M 
IM 






1 

1 




1 C^iC 


U « U*tO 


1 1 0 
1 1 u 


u . puu 




2 

mm 


1 CO 1 


1 fifiU 

1 • QOH 




i 1 n 

1 1 u 






3 


Score 


*f.3^U 


0.306 


no 


3.209 


0.197 


k 


AttJ tude 


3.536 
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0.381 


109 


3.979 
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TABLE 3 - GROUP MEANS 



TOTAL SEX GRADE SOCIAL CLASSES 
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Social Class 

On item 6^ higher class subjects appear to be the conservative. 
This was particularly conspicious under the incentive condition (PLNj^tnean » 
5tl for the higher class, PLN^ mean =5.9 for the loiddle class, PLN^mean « 
6.1 for the lower class). 

Middle class subjects appeared as moderate risk takers and appeared as 
siotivated to achieve success whereas lower class were^fear of failure" 
oriented. 
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Middle class students received slightly higher scores than the two other 
classes (though not statistically significant). They tended to display a 
xnotivation to achieve success. 

Lower class students fared the wrst on this test. They were most risk 
minded and therefore obtained the lowest scores because confidence testing 
penalizes guessing and rewards the acknowledgement of partial knowledge. 

Correlations were computed for each of the two samples for all 19 variables. 
The correlation matrices are shown in Tables 4 and 5. Correlations larger than 
r « .195 will be examined. For a single pair of variables, a correlation of 
0.195 indicates a significant departure from 0.0 at the 0.025 level with 100 
degrees of freedom. (Walker and Lev, 1953). Comparing significant correlations 
ix\ the t\70 matrices, it can be seen that there was a significant relationship 
between sex and attitude tox^ard the test with the girls liking it better than 
the boys. This sex difference was accentuated under the incentive condition. 
The males were more Autonomous and less Succorant in both groups. This should 
be expected because the personality test was not involved in the incentive 
instructions. Finally, only the difficult item provided a significant correlation 
with appropriateness of judgment of confidence and the propensity to gamble 
with the females shox7ing a greater willingness to make extreme choices, and 
also exhibiting greater congruence between their feelings of ceitainty and their 
behavior in responding to the items. That is, the females were more inclined 
to chose single responses, but they also felt more certain about their choices 
than did the males. Confidence was significantly related to score under both 
conditions, though the relationship was higher under the relaxed condition. 
That is, subjects were more willing to take extreme positions under the relaxed 
condition. 
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Thus the subjects seemed to be more motivated by fear of failure than by 
potential reward, ft Is of additional interest to note that there was 
no relationship between score and the gamble score on the easy items In 
the incentive condition, while the significnnt relationship was on the 
hard itcni under the incentive condition, )n fact, the gnmble hard score - 
score correlation changed sign going from the relaxed condition to the 
Incentive condition. That is, for the" h i'gH""scor ing S^s, there was a 
tendancy to assume extrerr.e positions on the. hard items under the reloved 
condition, but an unwillingness to do so under the incentive condition. 
That is, where grades were at stake, the high scoring S.s played the" 
cautious role. S^'s attitude tavcrd the test v/as related primarily to his 
confidence, although there was also a significant relationship with the 
gamble score in the incentive condition. Confidence was significantly 
related to inappropr iatness of judgment and to willingness to take extreme 
positions under the incentive condition. That is, under the incentive, 
subjects who v/ere confident about their responses were more willing to 
take extreme positions in responding. However, these extreme positions 
did not match their degrees of confidence very well. Several other of 
the Item scores were related to confidence in the relaxed condition, v/hile 
the gamble score became less important. The personality variables showed 
substantial i ntercorrel ations as did the cluster of gamble and appropr iatness 
scores. The significant negative correlations between the gamble and the 
appropriatness scores is due to the fact that these two scores are not 
independent of one another. The negative sign becomes obvious when one 
examines the means of computation of the appropriatness score (WPLM) from 
gamble score (PLN) . ■ ^ 

In order to better understand what variables contributed to S's 
expression of confidence, a regression analysis v/as performed. No sig- 
nificant regression held between confidence and any other variables, 
although high succorance and lav harm avoidance did contribute a small 
amount to the prediction of confidence in the relaxed condition only. 

Seventeen of the scores were factor analyzed. The PLN and WPLN 
variables for the item of medium difficulty were left out since they did 
not seem to provide much information. A principal components analysis 
was first performed. Then the principal components were rotated according 
to the fol laving specifications: A maximum of nine factors were to be 
extracted, the lo.ver limit of eigenroots was set at 1.00 and no factors 
were to have loadings of less than .30 for at. least one variable. 
According to these specifications, seven factors were rotated. Ten rota- 
tions were required in the- i ncent ive condition. Thirteen v/ere required 
in the relaxed condition. The factor matrix is shavn in Tables ^and y 
Loadings in excess of O.3O are underlined. 

In Interpreting these results, it should be recalled that a low 
numerical score on the Appropriate variable means that a person's 
responses were congruent with his confidence. The factor analysis did 
not reveal much about confidence, except to underline ir,e fact that there 
Is a dependence between.jt and the gamble and appropr i^te measures. This 
Is Illustrated in Factor 1 in both conditions. Factor 2 is made up of sex 
and several personality variables. Attitude is also a relevant variable 
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In the relaxed condition, but the iniportance of attitude in this factor is 
much reduced in the incentive condition. Factor 3 in the relaxed condition and 
Factor 5 in the incentive condition are quite similar and are made up entirely 
of the personality factors. Factor 4 is perhaps the only one of much interest. 
It shows a relationship among stx and the way in which S£ deal with the difficult 
Items. This, however, only confirms what has been previously said about sex 
differences with rcsspect to making dogmatic 'chblces on items. 

Conclusions 

The findings of this study are summarized as follows: 

1. Under incentive conditions, scores on confidence tests are higher, and 
reliability significantly lower when compared to the relaxed condition. 

2. Females have a greater tendency towai:d taking extreme positions than 
males, especially in the incentive condition. 

3. Subjects in the incentive group liked the test better, had more of a 
tendency to take extreme positions, and made more appropriate estimates of their 
confidence. 

4. Middle. SES subjects, compared to both upper and lower SES subjects, 
made higher scores and more appropriate estimates of confidence. They seemed 
to be motivated more by desire for success than fear of failure. 

5. High scoring subjects gambled more on difficult items under the relaxed 
condition, but gambled less on difficult items in the incentive condition. 

6. Liking of tests was directly related to confidence. 

7. There was no significant regression between confidence and the battery 
of personality variables, although high succorance and low harm avoidance made 
email contributions to prediction. 



Much work remains to be done in studying confidence testing. Although it 
is clear that technical improvements may be made in the reliability and validity 
of tests through confidence scores, it is also clear that subjects do not handle 
their confidence uniformly. What is confidence to one may be hazard to another. 
As Wang and Stanley state, (1970) 

"The derivation of optimum response strategies in multiple choice 
testing represents an application of mathematical decision theory 
which underscores the decision process inherent in such tests. 
The success of testing procedures which attempt to control the de- 
cision process will be critically dependent on the ability of 
subjects to effectively use optimal strati- ^s. It is not certain 
that all subjects are equally capable of ? ning to use such 
strategies." 

Understanding optimal strategies of probnbiJ ity assessment is likely to be 
the most significant outcome of further research on confidence testing. Although 
Bruner (1956) pointed out two basic differences in the way subjects use their 
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confidences - the sentry condition and accuracy condition, and demonstrated 
empirical evidence of these tx^o modes of behavior, there are other complex 
conditions which intervene between a subjective probability and a decision or 
action* Since it is possible, although not guaranteed that one may assess 
subjective probabilities accurately by means of reproducing scoring functions, 
some basic research steps are needed. First, subjects in experiments need 
experience in utilizing confidence testing. It takes awhile to learn to respond 
intelligently to the rules of that game. Second, the possibility of applying 
the relative operating characteristic to confidence testing needs to be explored 
(Swets, 1973). Once, a more valid interpretation of subjective probabilities 
was available, further study might be made of the use of optimal strategies by 
subjects in problematic situations. Such strategies would perhaps start with 
what is kno\m about optimal search procedures in polychotomic trees (VJatanabe, 
1969) . 

A sizeable field in this area remains unplowed. How do students react to 
problematic situations? Are students able to assess their state of information 
and respond intelligently to it? Do our teaching and testing practices make 
them av7are that there are differences among the ways we use our information? 
And to repeat DeFinnetti, "Hov? ecu we become better probability assessors?" 
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